# DepMap Public 21Q4

##############
## Overview ##
##############

This DepMap release contains data from CRISPR knockout screens from project Achilles, as well as genomic characterization data from the CCLE project.

###############
## Pipelines ##
###############

### Achilles

This Achilles dataset contains the results of genome-scale CRISPR knockout screens for Achilles (18017 genes in 925 cell lines) and Achilles combined with Project SCORE screens (17386 genes in 1054 cell lines). The dataset was processed using the following steps:

- Sum raw readcounts by replicate and guide
- Remove the list of guides with suspected off-target activity
- Remove guides with pDNA counts less than one millionth of the pDNA pool
- Remove replicates that fail fingerprinting match to parent or derivative lines
- Remove replicates with total reads less than 15 million
- Calculate reads per million, add pseudo-count of 1, then log2-fold-change from pDNA counts for each replicate
- Calculate the NNMD for each replicate using genes targeting the Hart reference non-essentials and the intersection of the Hart and Blomen essentials, and remove those with values more positive than -1.0. See Hart et al., Mol. Syst. Biol, 2014 and Blomen et al., Science, 2015
- Remove replicates that do not have a Pearson coefficient > .61 with at least one other replicate for the line when looking at genes with the highest variance (top 3%) in gene effect across cell lines. Equivalent to excluding lines whose replicates have a p-value greater than 0.05 of having at least as high a correlation with a random line as with each other.
- Calculate the NNMD for each cell line after averaging remaining replicates, and remove those more positive than -1.0
- Run Chronos to generate gene-level scores
- Scale Chronos output so that median of common essentials over the whole dataset is -1.0
- Combine and batch correct the Achilles and Sanger SCORE gene scores into merged CRISPR datasets
- Identify pan-dependent genes as those for whom 90% of cell lines rank the gene above a given dependency cutoff. The cutoff is determined from the central minimum in a histogram of gene ranks in their 90th percentile least dependent line
- For each Chronos/CRISPR gene score, infer the probability that the score represents a true dependency or not. This is done using an EM step until convergence independently in each cell line. The dependent distribution is given by the list of essential genes. The null distribution is determined from unexpressed gene scores in those cell lines that have expression data available, and from the Hart non-essential gene list in the remainder
- Replace all genes on X chromosome for cell lines that only have Broad SNP copy number data with NA in gene_effect.csv, gene_effect_unscaled.csv, and gene_dependency.csv

The source for copy number data varies by cell line. Copy number data indicated as "Sanger WES" are based on the Sanger Institute whole exome sequencing data (COSMIC: http://cancer.sanger.ac.uk/cell_lines, EGA accession number: EGAD00001001039) reprocessed using CCLE pipelines. Copy number source was chosen according to the following logic:

- Broad WES for lines where available
- Broad SNP when Broad WES is not available and Sanger WES not available, or Sanger WES copy number has less correlation with logfold change than Broad SNP
- Sanger WES in all other cases

Details about data processing are published on bioRxiv here: https://www.biorxiv.org/content/10.1101/720243v1 

### Expression

CCLE expression data is quantified from RNAseq files using the GTEx pipelines. A detailed description of the pipelines and tool versions can be found here: https://github.com/broadinstitute/ccle_processing#rnaseq. We provide a subset of the data files outputted from this pipeline available on FireCloud. These are aligned to hg38.

### Copy number

CCLE WES copy number data is generated by running the GATK copy number pipeline aligned to hg38. Tutorials and descriptions of this method can be found here https://software.broadinstitute.org/gatk/documentation/article?id=11682, https://software.broadinstitute.org/gatk/documentation/article?id=11683. WES samples have been realigned to hg38 and run through this pipeline.

### Mutations

CCLE mutation calls are aggregated from several different sources and sequencing technologies.

### Fusions

CCLE generates RNAseq based fusion calls using the STAR-Fusion pipeline. A comprehensive overview of how the STAR-Fusion pipeline works can be found here: https://github.com/STAR-Fusion/STAR-Fusion/wiki. We run STAR-Fusion version 1.6.0 using the plug-n-play resources available in the STAR-Fusion docs for gencode v29. We run the fusion calling with default parameters except we add the --no_annotation_filter and --min_FFPM 0 arguments to prevent filtering.

### Omics Updates

 [Changes to cell lines] For ACH-001011 and ACH-001108 we only had Hybrid Capture data. These two lines are known to be contaminated according to Cellosaurus. We removed both lines from our omics data. Furthermore, there was only an older version of copy number available for ACH-002335 with no other accompanying datasets or source file. We dropped this data to make the copy numbers harmonized across cell lines. We also dropped CNV of ACH-002512 due to low quality scores.
[CN gene names] For gene level copy number data we updated the version of the Ensembl gene names to be consistent with the version (i.e. Ensembl Archive Release 102, Nov 2020) currently used in our gene expression data. 

###########
## Files ##
###########

### README.txt

Description of all files contained in this release

### Achilles_gene_effect.csv

Pipeline: Achilles

_Post-Chronos_

Chronos data, copy number corrected. 

- Columns: genes in the format "HUGO (Entrez)"
- Rows: cell lines (Broad IDs) 

### Achilles_gene_dependency.csv

Pipeline: Achilles

_Post-Chronos_ Probability that knocking out the gene has a real depletion effect using gene_effect. - Columns: genes in the format "HUGO (Entrez)" - Rows: cell lines (Broad IDs)

### Achilles_common_essentials.csv

Pipeline: Achilles

_Post-Chronos_ List of genes identified as pan-essentials using Chronos

### Achilles_guide_efficacy.csv

Pipeline: Achilles

_Post-Chronos_ Columns: - sgrna (nucleotides) - efficacy - Chronos inferred efficacy for the guide

### Achilles_cell_line_efficacy.csv

Pipeline: Achilles

_Post-Chronos_ Columns: - cell lines (Broad IDs) - avana - Chronos inferred efficacy for the cell line in Avana (Achilles) screens.

### Achilles_cell_line_growth_rate.csv

Pipeline: Achilles

_Post-Chronos_ Columns: - cell lines (Broad IDs) - avana - Chronos inferred unperturbed growth rate for the cell line in Avana (Achilles) screens

### CRISPR_dataset_sources.csv

Pipeline: Achilles

_Post-Chronos_ Columns: - cell lines (Broad IDs) - dataset - source dataset for combined data (Achilles, Score, or Both)

### CRISPR_gene_effect.csv

Pipeline: Achilles

_Post-Chronos_

Combined Achilles and Sanger SCORE Chronos data using Harmonia (the batch correction pipeline described here: https://www.biorxiv.org/content/10.1101/2020.05.22.110247v3) 

- Columns: genes in the format "HUGO (Entrez)"
- Rows: cell lines (Broad IDs)

### CRISPR_gene_dependency.csv

Pipeline: Achilles

_Post-Chronos_ Probability that knocking out the gene has a real depletion effect using CRISPR_gene_effect. - Columns: genes in the format "HUGO (Entrez)" - Rows: cell lines (Broad IDs)

### CRISPR_common_essentials.csv

Pipeline: Achilles

_Post-Chronos_ List of genes identified as dependencies in all lines, one per line.

### common_essentials.csv

Pipeline: Achilles

_Pre-Chronos file_
List of genes used as positive controls, intersection of Biomen (2014) and Hart (2015) essentials in the format "HUGO (Entrez)". Each entry is separated by a newline.The scores of these genes are used as the dependent distribution for inferring dependency probability.

### nonessentials.csv

Pipeline: Achilles

_Pre-Chronos file_

List of genes used as negative controls (Hart (2014) nonessentials) in the format "HUGO (Entrez)". Each entry is separated by a newline.

### Achilles_raw_readcounts.csv

Pipeline: Achilles

_Pre-Chronos file_

Summed counts for each replicate/PDNA

- Columns: replicate/pDNA IDs
- Rows: Guides (nucleotides)

### Achilles_raw_readcounts_failures.csv

Pipeline: Achilles

_Pre-Chronos file_

Summed counts for each replicate failing quality control checks

- Columns: replicate IDs
- Rows: Guides (nucleotides)

### Achilles_logfold_change.csv

Pipeline: Achilles

_Pre-Chronos file_

Post-QC log2-fold change (not ZMADed)

- Columns: replicate IDs
- Rows: Guides (nucleotides)

### Achilles_logfold_change_failures.csv

Pipeline: Achilles

_Pre-Chronos file_

Post-QC log2-fold change (not ZMADed) for cell lines failing quality control checks

- Columns: replicate IDs
- Rows: Guides (nucleotides)

### Achilles_guide_map.csv

Pipeline: Achilles

_Pre-Chronos file_

Columns:

- sgrna (nucleotides) - appears more than once
- genome_alignment
- gene ("HUGO (Entrez)")
- n_alignments (integer number of perfect matches for that guide)

### Achilles_replicate_map.csv

Pipeline: Achilles

_Pre-Chronos file_

Columns:

- replicate_ID (str)
- Broad_ID
- pDNA_batch (int): indicates which processing batch the replicate belongs to and therefore which pDNA reference it should be compared with.
- passes_QC (str): indicates if the replicate was included in Chronos calculations 

### Achilles_replicate_QC_report_failing.csv

Pipeline: Achilles

_Pre-Chronos file_ Rows: replicate IDs
Columns: - failure_mode (reason replicate failed or NA) - total_reads - Pearson_corr_with_rep_A/B/C/D (Pearson correlation with sibling replicates) - num_sibling_replicates_passing_QC (count of sibling replicates that passed) - replicate_level_NNMD_pass (boolean indicating whether replicate passed NNMD threshold) - replicate_level_NNMD (float) - excluded_from_processing (boolean indicating if replicate was excluded from further QC processing) - DepMap_ID - FP_unknown (boolean indicating if fingerprinting status was unknown) - can_include_in_dataset (boolean) 

### Achilles_dropped_guides.csv

Pipeline: Achilles

_Pre-Chronos file_

Columns:

- sgrna (nucleotides) - appears more than once
- genome_alignment
- gene ("HUGO (Entrez)")
- n_alignments (integer number of perfect matches for that guide)
- fail_reason (why this guide is not used for gene effect/dependency calculation) Note: in_dropped_guides = guide dropped for suspected off-target activity 

### Achilles_high_variance_genes.csv

Pipeline: Achilles

_Pre-Chronos file_

List of genes with top 3% most variable scores across cell lines in 18Q4 gene_effect. Used for replicate correlation in quality control step.

### CCLE_RNAseq_reads.csv

Pipeline: Expression

RNAseq read count data from RSEM.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Ensembl ID)

### CCLE_expression_full.csv

Pipeline: Expression

RNAseq TPM gene expression data for all genes using RSEM. Log2 transformed, using a pseudo-count of 1.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Ensembl ID)

### CCLE_expression.csv

Pipeline: Expression

RNAseq TPM gene expression data for just protein coding genes using RSEM. Log2 transformed, using a pseudo-count of 1.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Entrez ID)

### CCLE_expression_transcripts_expected_count.csv

Pipeline: Expression

RNAseq read count data from RSEM.

- Rows: cell lines (Broad IDs)
- Columns: transcripts (HGNC symbol and ensembl transcript ID)

### CCLE_expression_proteincoding_genes_expected_count.csv

Pipeline: Expression

RNAseq read count data from RSEM for just protein coding genes.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Entrez ID)

### CCLE_RNAseq_transcripts.csv

Pipeline: Expression

RNAseq transcript tpm data using RSEM. Log2 transformed, using a pseudo-count of 1.

- Rows: cell lines (Broad IDs)
- Columns: transcripts (HGNC symbol and ensembl transcript ID)

### CCLE_segment_cn.csv

Pipeline: Copy number


Segment level copy number data

- DepMap_ID
- Chromosome
- Start (bp start of the segment)
- End (bp end of the segment)
- Num_Probes (the number of targeting probes that make up this segment)
- Segment_Mean (relative copy ratio for that segment)
- amplification status (+,-,0)

### CCLE_wes_segment_cn.csv

Pipeline: Copy number


Segment level copy number data from whole exome sequencing

- DepMap_ID
- Chromosome
- Start (bp start of the segment)
- End (bp end of the segment)
- Num_Probes (the number of targeting probes that make up this segment)
- Segment_Mean (relative copy ratio for that segment)
- amplification status (+,-,0)

### CCLE_gene_cn.csv

Pipeline: Copy number


Gene level copy number data, log2 transformed with a pseudo count of 1. This is generated by mapping genes onto the segment level calls.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Entrez ID)

### CCLE_wes_gene_cn.csv

Pipeline: Copy number


Gene level copy number data from whole exome sequencing, log2 transformed with a pseudo count of 1. This is generated by mapping genes onto the segment level calls.

- Rows: cell lines (Broad IDs)
- Columns: genes (HGNC symbol and Entrez ID)

### CCLE_fusions.csv

Pipeline: Fusions

Gene fusion data derived from RNAseq data. Data is filtered using by performing the following:

- Removing fusion involving mitochondrial chromosomes or HLA genes
- Removed common false positive fusions (red herring annotations as described in the STAR-Fusion docs)
- Recurrent fusions observed in CCLE across cell lines (in 10% or more of the samples)
- Removed fusions where SpliceType="INCL_NON_REF_SPLICE" and LargeAnchorSupport="NO_LDAS" and FFPM < 0.1
- FFPM < 0.05

Column descriptions can be found in the STAR-Fusion wiki, except for CCLE_count, which indicates the number of CCLE samples that have this fusion.

### CCLE_fusions_unfiltered.csv

Pipeline: Fusions

Gene fusion data derived from RNAseq data. Data is unfiltered. Column descriptions can be found in the STAR-Fusion wiki

### CCLE_mutations.csv

Pipeline: Mutations

MAF of gene mutations.

For all columns with AC, the allelic ratio is presented as [ALTERNATE:REFERENCE].

- CGA_WES_AC: the allelic ratio for this variant in all our WES/WGS(exon only) using a cell line adapted version of the 2019 CGA pipeline that includes germline filtering.
- SangerWES_AC: in Sanger WES (called by sanger) (legacy)
- SangerRecalibWES_AC: in Sanger WES after realignment at Broad (legacy)
- RNAseq_AC: in Broad RNAseq data from the CCLE2 project (legacy)
- HC_AC: in Broad Hybrid capture data from the CCLE2 project (legacy)
- RD_AC: in Broad Raindance data from the CCLE2 project (legacy)
- legacy_wgs_exon_only: in Broad WGS data from the CCLE2 project (legacy)

Additional columns:

- isTCGAhotspot: is this mutation commonly found in TCGA
- TCGAhsCnt: number of times this mutation is observed in TCGA
- isCOSMIChotspot: is this mutation commonly found in COSMIC
- COSMIChsCnt: number of samples in COSMIC with this mutation
- ExAC_AF: the allelic frequency in the Exome Aggregation Consortium (ExAC)

Descriptions of the remaining columns in the MAF can be found here: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

### CCLE_mutations_bool_hotspot.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one hotspot mutation

### CCLE_mutations_bool_damaging.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one damaging mutation

### CCLE_mutations_bool_nonconserving.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one other nonconserving mutation

### CCLE_mutations_bool_otherconserving.csv

Pipeline: Mutations

Boolean matrix determining for each cell line whether each gene has at least one other conserving mutation

### sample_info.csv

Cell line information definitions

- DepMap_ID: Static primary key assigned by DepMap to each cell line
- stripped_cell_line_name: Cell line name with alphanumeric characters only
- CCLE_Name: Previous naming system that used the stripped cell line name followed by the lineage; no longer assigned to new cell lines
- alias: Additional cell line identifiers (not a comprehensive list)
- COSMIC_ID: Cell line ID used in Cosmic cancer database
- lineage, lineage_subtype, lineage_sub_subtype, lineage_molecular_subtype: Cancer type classifications in a standardized form
- sex: Sex of tissue donor if known
- source: Source of cell line vial used by DepMap
- Achilles_n_replicates: Number of replicates used in Achilles CRISPR screen passing QC
- cell_line_NNMD: Difference in the means of positive and negative controls normalized by the standard deviation of the negative control distribution
- culture_type: Growth pattern of cell line (Adherent, Suspension, Mixed adherent and suspension, 3D, or Adherent (requires laminin coating))
- culture_medium: Medium used to grow cell line
- cas9_activity: Percentage of cells remaining GFP negative on days 12-14 of cas9 activity assay as measured by FACs
- RRID: Cellosaurus research resource identifier
- sample_collection_site: Tissue collection site
- primary_or_metastasis: Indicates whether tissue sample is from primary or metastatic site
- disease: General cancer lineage category
- disease_subtype: Subtype of disease; specific disease name
- age: If known, age of tissue donor at time of sample collection
- Sanger_model_ID: Sanger Institute Cell Model Passport ID
- additional_info: Further information about cell line modifications and drug resistance

